Introduction

The document contains a detailed analysis of forecast sale expectations of High-End Vacuums for Vac-Attack company in December 2020 using historical data. This analysis aims to determine if the company will meet the target and ensure the stock is on hand to match the demand.

The report describes data cleaning, summary statistics, visualisation and analysis using the statistical model fitting, implemented with R programming language.

Loading and installing the required R package

if (!require(tidyverse))
  install.packages("tidyverse")

if (!require(lubridate))
  install.packages("lubridate")

if (!require(plotly))
  install.packages("plotly")

if (!require(car))
  install.packages("stargazer")

if (!require(xgboost))
  install.packages("xgboost")

library(tidyverse)
library(lubridate)
library(stargazer)
library(car)
library(xgboost)

Importing Data

First import the Market data.

market_sale_col <- readr::read_csv("Data/MarketingCols.csv", col_names = FALSE)

market_sale_raw <- readr::read_csv("Data/MarketingSales.csv", 
                                   col_names = market_sale_col$X1)

Then import the December data for prediction.

dec_col <- readr::read_csv("Data/DecemberCols.csv", col_names = FALSE)

dec_ad_raw <- readr::read_csv("Data/DecemberAdData.csv", 
                                   col_names = dec_col$X1)

Quick explore of the read-in the data sets

dim(market_sale_raw)
## [1] 1796   11
glimpse(market_sale_raw)
## Rows: 1,796
## Columns: 11
## $ Date                   <chr> "1/01/16", "2/01/16", "3/01/16", "4/01/16", "5/…
## $ PositiveNews           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ NegativeCoverage       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Competition            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AdvertisingSpend       <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, …
## $ Month                  <chr> "January", "January", "January", "January", "Ja…
## $ Day                    <chr> "Friday", "Saturday", "Sunday", "Monday", "Tues…
## $ `0508Line_247`         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ UltraEdition_Available <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COVID_Lockdown         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Sales                  <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83,…
dim(dec_ad_raw)
## [1] 31  4
glimpse(dec_ad_raw)
## Rows: 31
## Columns: 4
## $ Date             <chr> "1/12/20", "2/12/20", "3/12/20", "4/12/20", "5/12/20"…
## $ AdvertisingSpend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17…
## $ Month            <chr> "December", "December", "December", "December", "Dece…
## $ Day              <chr> "Tuesday", "Wednesday", "Thursday", "Friday", "Saturd…

Tidy up the variables of both datasets

The next step is to tidy up the data sets for visualisation and modelling. I have also included year, month and day variables in both factor and numeric formats.

market_sale_final <- 
  market_sale_raw %>% 
  mutate(Date = dmy(Date),
         Ad_Spend =  AdvertisingSpend,
         Phone_24 = factor(`0508Line_247`),
         Positive_News = factor(PositiveNews),
         Negative_News = factor(NegativeCoverage),
         Competition = factor(Competition),
         Ultra_Edition = factor( UltraEdition_Available), 
         COVID_Lockdown = factor(COVID_Lockdown),
         Year = factor(year(Date)),
         Month = factor(Month, levels = month.name),
         Day = factor(Day, levels =
                        c("Monday", "Tuesday", "Wednesday", "Thursday",
                          "Friday", "Saturday", "Sunday")),
         Month_num = as.numeric(Month),
         Year_num = year(Date),
         Day_num = as.numeric(Day),
         Date_num = date(Date)) %>% 
  select(Sales, Ad_Spend, Date, Phone_24, Positive_News, Negative_News, 
         Competition, Ultra_Edition, COVID_Lockdown,
         Year, Month, Day, Day_num, Date_num, Month_num, Year_num)

dim(market_sale_final)
## [1] 1796   16
glimpse(market_sale_final)
## Rows: 1,796
## Columns: 16
## $ Sales          <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83, 62, 94,…
## $ Ad_Spend       <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, 3962.76,…
## $ Date           <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Phone_24       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Positive_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Negative_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year           <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ Month          <fct> January, January, January, January, January, January, J…
## $ Day            <fct> Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, T…
## $ Day_num        <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ Date_num       <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Month_num      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Year_num       <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
dec_ad_final <-
  dec_ad_raw %>%
  mutate(
    Date = dmy(Date),
    Ad_Spend =  AdvertisingSpend,
    Phone_24 = factor(0, levels = c(0, 1)),
    Positive_News = factor(0, levels = c(0, 1)),
    Negative_News = factor(0, levels = c(0, 1)),
    Competition  = factor(0, levels = c(0, 1)),
    Ultra_Edition = factor(1, levels = c(0, 1)),
    COVID_Lockdown = factor(0, levels = c(0, 1)),
    Year = factor(2020, levels = market_sale_final$Year |> levels() ),
    Month = factor(Month, levels = month.name),
    Day = factor(Day, levels =
                   c("Monday", "Tuesday", "Wednesday", "Thursday",
                        "Friday", "Saturday", "Sunday")),
    Year_num = 2020,
    Month_num = 12,
    Day_num = as.numeric(Day),
    Date_num = date(Date)
  ) %>% 
  select(Ad_Spend, Date, Phone_24, Positive_News, Negative_News, 
         Competition, Ultra_Edition, 
         COVID_Lockdown, Year, Month, Day, Day_num, Date_num, Month_num, Year_num)

dim(dec_ad_final)
## [1] 31 15
glimpse(dec_ad_final)
## Rows: 31
## Columns: 15
## $ Ad_Spend       <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17, …
## $ Date           <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Phone_24       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Positive_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Negative_News  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year           <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ Month          <fct> December, December, December, December, December, Decem…
## $ Day            <fct> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,…
## $ Day_num        <dbl> 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ Date_num       <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Month_num      <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Year_num       <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…

Data exploration

Data visulisation

I will first generate some plots to look for any relationship.

The first plot is total unit sold versus time.

ggplot(market_sale_final, aes(x = Date, y = Sales)) +
  geom_path() +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light()

There is clearing some seasonality on the total unit sold versus time for adjustment

The second plot is to examine relationship between the total unit sold versus the total advertising spend.

ggplot(market_sale_final, aes(x = Ad_Spend, y = Sales)) +
  geom_point()

There is a weak positive correlation between Advertising Spend and total unit sold, with the pearson correlation of 41.1%.

ggplot(market_sale_final, aes(x = Date)) +
  geom_line(aes(y = Sales), linewidth = 1.5, color = "red", alpha = 0.8) + 
  geom_line(aes(y = Ad_Spend/100), color = "blue", alpha = 0.8) + 
  scale_y_continuous(
    name = "The units sold",
    # Add a second axis and specify its features
    sec.axis = sec_axis(~.*100, name="Total Advertising Spend ($)")
  ) +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light() +
  theme(
    axis.title.y = element_text(color = "red", size=13),
    axis.title.y.right = element_text(color = "blue", size=13)
  )

hist(market_sale_final$Sales)

The overall number of Sales is looking reasonably normal, thus there is no need to apply any additional normalization methods.

Descriptive summary

Total units sold versus years.

market_sale_final %>%
  group_by(Year) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Year Sales_mean Sales_sum
2016 89.48087 32750
2017 82.02740 29940
2018 76.81370 28037
2019 78.07671 28498
2020 66.04179 22124

The total units sold appears to decline from year to year.

Total units sold versus months

market_sale_final %>%
  group_by(Month) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Month Sales_mean Sales_sum
January 66.17419 10257
February 65.78169 9341
March 64.52258 10001
April 59.66000 8949
May 65.76129 10193
June 71.71333 10757
July 71.52258 11086
August 80.06452 12410
September 88.28000 13242
October 91.63226 14203
November 107.91333 16187
December 118.73387 14723

The most of total units sold appears to be at the later months of the year.

Total units sold versus days of the week.

market_sale_final %>%
  group_by(Day) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Day Sales_mean Sales_sum
Monday 78.26848 20115
Tuesday 78.85156 20186
Wednesday 77.59766 19865
Thursday 78.82422 20179
Friday 79.33852 20390
Saturday 78.82101 20257
Sunday 79.21012 20357

The total units sold appears to very similar between the days of each week.

market_sale_final %>%
  group_by(Ultra_Edition) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Ultra_Edition Sales_mean Sales_sum
0 81.28650 74052
1 76.04181 67297
market_sale_final %>%
  group_by(Year, Ultra_Edition) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Year Ultra_Edition Sales_mean Sales_sum
2016 0 89.48087 32750
2017 0 82.02740 29940
2018 0 63.12222 11362
2018 1 90.13514 16675
2019 1 78.07671 28498
2020 1 66.04179 22124
market_sale_final %>%
  group_by(Positive_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Positive_News Sales_mean Sales_sum
0 78.22094 135948
1 93.12069 5401
market_sale_final %>%
  group_by(Negative_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales)) %>% 
  knitr::kable()
Negative_News Sales_mean Sales_sum
0 78.85618 140364
1 61.56250 985
market_sale_final %>%
  group_by(Positive_News, Negative_News) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
## `summarise()` has grouped output by 'Positive_News'. You can override using the
## `.groups` argument.
Positive_News Negative_News Sales_mean Sales_sum
0 0 78.37573 134963
0 1 61.56250 985
1 0 93.12069 5401
market_sale_final %>%
  group_by(Competition)  %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Competition Sales_mean Sales_sum
0 80.09143 111247
1 73.96069 30102
market_sale_final %>%
  group_by(COVID_Lockdown) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
COVID_Lockdown Sales_mean Sales_sum
0 80.24484 139947
1 26.96154 1402
market_sale_final %>%
  group_by(Phone_24) %>%
  summarise(Sales_mean = mean(Sales),
            Sales_sum = sum(Sales))%>% 
  knitr::kable()
Phone_24 Sales_mean Sales_sum
0 76.10649 95057
1 84.62888 46292

Modelling using linear regression

Since the distribution of the total units sold is relatively normal with a belt-shaped curve, I will start by fitting a multiple linear regression model for the total units sold, as seen below. Using a multiple linear regression model is a good starting point, because it is relatively straightforward to fit the model and perform model diagnostics to decide whether a more complicated modelling technique is required. Further, linear regression analysis makes inference easy, i.e. interpreting which predictor variable has statistically significant effects or has the most influence on the target variable. Lastly, linear regression analysis can be used to make predictions about the value of the target variable based on the values of the predictor variables.

Fitting the model

linear_reg_fit <-
  lm(
    Sales ~ Ad_Spend + (Year * Month) / Day +
      COVID_Lockdown + Ultra_Edition + Phone_24 + 
      Positive_News + Negative_News + Competition,
    data = market_sale_final
  )

ANOVA table of the initial model

anova(linear_reg_fit)
## Analysis of Variance Table
## 
## Response: Sales
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## Ad_Spend          1 151427  151427 5321.5665 < 2.2e-16 ***
## Year              4  94608   23652  831.1967 < 2.2e-16 ***
## Month            11 517817   47074 1654.3260 < 2.2e-16 ***
## COVID_Lockdown    1  45137   45137 1586.2609 < 2.2e-16 ***
## Ultra_Edition     1    301     301   10.5642  0.001181 ** 
## Phone_24          1  10715   10715  376.5428 < 2.2e-16 ***
## Positive_News     1   9672    9672  339.8899 < 2.2e-16 ***
## Negative_News     1   7108    7108  249.8029 < 2.2e-16 ***
## Competition       1    109     109    3.8149  0.051001 .  
## Year:Month       42  13192     314   11.0382 < 2.2e-16 ***
## Year:Month:Day  354   8297      23    0.8236  0.987476    
## Residuals      1377  39183      28                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choose a model by AIC in a Stepwise Algorithm

linear_reg_fit_final <- step(linear_reg_fit) 
## Start:  AIC=6374.49
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition + 
##     Phone_24 + Positive_News + Negative_News + Competition
## 
## 
## Step:  AIC=6374.49
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month + 
##     Year:Month:Day
## 
##                   Df Sum of Sq    RSS    AIC
## - Year:Month:Day 354      8297  47479 6011.4
## - Ultra_Edition    1        21  39204 6373.5
## - Competition      1        39  39222 6374.3
## <none>                          39183 6374.5
## - Negative_News    1      6008  45191 6628.7
## - Positive_News    1      7675  46858 6693.7
## - COVID_Lockdown   1     15661  54844 6976.4
## - Ad_Spend         1    114379 153562 8825.6
## 
## Step:  AIC=6011.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month
## 
##                  Df Sum of Sq    RSS    AIC
## - Ultra_Edition   1        23  47502 6010.3
## - Competition     1        30  47510 6010.6
## <none>                         47479 6011.4
## - Negative_News   1      8015  55494 6289.6
## - Positive_News   1      9093  56572 6324.1
## - COVID_Lockdown  1     15965  63444 6530.0
## - Year:Month     43     22732  70212 6628.0
## - Ad_Spend        1    135383 182862 8431.2
## 
## Step:  AIC=6010.29
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News + 
##     Negative_News + Competition + Year:Month
## 
##                  Df Sum of Sq    RSS    AIC
## - Competition     1        30  47532 6009.4
## <none>                         47502 6010.3
## - Negative_News   1      8015  55517 6288.3
## - Positive_News   1      9085  56588 6322.6
## - COVID_Lockdown  1     15965  63467 6528.7
## - Year:Month     43     23191  70694 6638.3
## - Ad_Spend        1    135379 182881 8429.4
## 
## Step:  AIC=6009.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News + 
##     Negative_News + Year:Month
## 
##                  Df Sum of Sq    RSS    AIC
## <none>                         47532 6009.4
## - Negative_News   1      8015  55547 6287.3
## - Positive_News   1      9081  56614 6321.4
## - COVID_Lockdown  1     15964  63497 6527.5
## - Year:Month     43     23830  71363 6653.3
## - Ad_Spend        1    135661 183193 8430.5

ANOVA table of the final model

anova(linear_reg_fit_final) 
## Analysis of Variance Table
## 
## Response: Sales
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## Ad_Spend          1 151427  151427 5520.900 < 2.2e-16 ***
## Year              4  94608   23652  862.331 < 2.2e-16 ***
## Month            11 517817   47074 1716.293 < 2.2e-16 ***
## COVID_Lockdown    1  45137   45137 1645.678 < 2.2e-16 ***
## Positive_News     1   9692    9692  353.375 < 2.2e-16 ***
## Negative_News     1   7520    7520  274.174 < 2.2e-16 ***
## Year:Month       43  23830     554   20.205 < 2.2e-16 ***
## Residuals      1733  47532      27                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Parameter estimates

summary(linear_reg_fit_final)  
## 
## Call:
## lm(formula = Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + 
##     Positive_News + Negative_News + Year:Month, data = market_sale_final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6621  -3.6102  -0.2807   3.3290  27.3765 
## 
## Coefficients: (1 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.565e+01  9.609e-01  68.324  < 2e-16 ***
## Ad_Spend                 1.416e-03  2.014e-05  70.328  < 2e-16 ***
## Year2017                 1.754e+00  1.331e+00   1.318 0.187765    
## Year2018                -1.706e+01  1.331e+00 -12.817  < 2e-16 ***
## Year2019                -2.031e+01  1.331e+00 -15.255  < 2e-16 ***
## Year2020                -2.133e+01  1.331e+00 -16.025  < 2e-16 ***
## MonthFebruary           -1.076e+00  1.355e+00  -0.794 0.427270    
## MonthMarch               2.185e-01  1.331e+00   0.164 0.869649    
## MonthApril              -1.851e+00  1.342e+00  -1.379 0.168043    
## MonthMay                -2.427e+00  1.331e+00  -1.824 0.068390 .  
## MonthJune                1.771e+00  1.342e+00   1.320 0.186943    
## MonthJuly                9.950e-01  1.330e+00   0.748 0.454593    
## MonthAugust              9.419e+00  1.332e+00   7.069 2.26e-12 ***
## MonthSeptember           1.897e+01  1.341e+00  14.143  < 2e-16 ***
## MonthOctober             2.745e+01  1.332e+00  20.605  < 2e-16 ***
## MonthNovember            4.178e+01  1.342e+00  31.126  < 2e-16 ***
## MonthDecember            4.905e+01  1.331e+00  36.860  < 2e-16 ***
## COVID_Lockdown1         -3.483e+01  1.444e+00 -24.125  < 2e-16 ***
## Positive_News1           1.290e+01  7.090e-01  18.196  < 2e-16 ***
## Negative_News1          -2.277e+01  1.332e+00 -17.094  < 2e-16 ***
## Year2017:MonthFebruary  -1.025e+01  1.926e+00  -5.323 1.16e-07 ***
## Year2018:MonthFebruary   7.149e+00  1.924e+00   3.715 0.000210 ***
## Year2019:MonthFebruary   3.171e+00  1.924e+00   1.649 0.099408 .  
## Year2020:MonthFebruary   5.005e+00  1.914e+00   2.615 0.008990 ** 
## Year2017:MonthMarch     -3.546e+00  1.883e+00  -1.883 0.059855 .  
## Year2018:MonthMarch      7.306e-01  1.883e+00   0.388 0.698001    
## Year2019:MonthMarch      2.308e+00  1.882e+00   1.226 0.220208    
## Year2020:MonthMarch      5.374e+00  1.910e+00   2.813 0.004961 ** 
## Year2017:MonthApril     -6.831e+00  1.898e+00  -3.599 0.000329 ***
## Year2018:MonthApril      4.347e+00  1.898e+00   2.291 0.022106 *  
## Year2019:MonthApril      8.629e+00  1.898e+00   4.546 5.86e-06 ***
## Year2020:MonthApril      8.313e+00  2.385e+00   3.486 0.000502 ***
## Year2017:MonthMay        1.501e+00  1.882e+00   0.797 0.425282    
## Year2018:MonthMay        6.937e+00  1.882e+00   3.686 0.000235 ***
## Year2019:MonthMay        1.105e+01  1.882e+00   5.870 5.22e-09 ***
## Year2020:MonthMay        1.115e+01  2.007e+00   5.555 3.21e-08 ***
## Year2017:MonthJune      -4.719e+00  1.898e+00  -2.487 0.012975 *  
## Year2018:MonthJune       3.308e+00  1.897e+00   1.743 0.081469 .  
## Year2019:MonthJune       1.280e+01  1.900e+00   6.740 2.14e-11 ***
## Year2020:MonthJune       8.612e+00  1.898e+00   4.538 6.06e-06 ***
## Year2017:MonthJuly      -1.327e+01  1.882e+00  -7.050 2.57e-12 ***
## Year2018:MonthJuly       8.134e+00  1.881e+00   4.324 1.62e-05 ***
## Year2019:MonthJuly       1.871e+01  1.881e+00   9.947  < 2e-16 ***
## Year2020:MonthJuly       8.028e+00  1.882e+00   4.265 2.11e-05 ***
## Year2017:MonthAugust    -9.244e+00  1.885e+00  -4.904 1.02e-06 ***
## Year2018:MonthAugust     5.231e+00  1.884e+00   2.776 0.005563 ** 
## Year2019:MonthAugust     1.030e+01  1.883e+00   5.472 5.09e-08 ***
## Year2020:MonthAugust     1.518e+01  1.885e+00   8.052 1.50e-15 ***
## Year2017:MonthSeptember -1.093e+01  1.897e+00  -5.760 9.95e-09 ***
## Year2018:MonthSeptember  1.773e+00  1.897e+00   0.935 0.350071    
## Year2019:MonthSeptember  1.298e+01  1.898e+00   6.839 1.10e-11 ***
## Year2020:MonthSeptember  1.043e+01  1.897e+00   5.496 4.46e-08 ***
## Year2017:MonthOctober   -1.429e+01  1.884e+00  -7.584 5.43e-14 ***
## Year2018:MonthOctober    2.755e+00  1.883e+00   1.463 0.143639    
## Year2019:MonthOctober    5.206e+00  1.885e+00   2.762 0.005803 ** 
## Year2020:MonthOctober    8.350e+00  1.882e+00   4.437 9.69e-06 ***
## Year2017:MonthNovember  -1.919e+01  1.897e+00 -10.118  < 2e-16 ***
## Year2018:MonthNovember   7.326e+00  1.898e+00   3.860 0.000117 ***
## Year2019:MonthNovember   1.230e+01  1.898e+00   6.482 1.18e-10 ***
## Year2020:MonthNovember   4.986e+00  1.899e+00   2.626 0.008728 ** 
## Year2017:MonthDecember  -1.355e+01  1.882e+00  -7.202 8.83e-13 ***
## Year2018:MonthDecember   6.887e+00  1.884e+00   3.656 0.000264 ***
## Year2019:MonthDecember   9.716e+00  1.882e+00   5.161 2.73e-07 ***
## Year2020:MonthDecember          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.237 on 1733 degrees of freedom
## Multiple R-squared:  0.947,  Adjusted R-squared:  0.9451 
## F-statistic: 499.9 on 62 and 1733 DF,  p-value: < 2.2e-16
sum_stats <- summary(linear_reg_fit_final)

Note that this multiple linear regression model has an adjusted R-square of 0.9451481, which means this model can explain 94.51% of the information, thus a very accurate model.

Model diagnostics

hist(linear_reg_fit_final$residuals)

plot(fitted(linear_reg_fit_final), resid(linear_reg_fit_final) )

# calculate the mean squared error
mse_linear <- 
  sum(market_sale_final$Sales - 
  predict.glm(linear_reg_fit_final,
           newdata = market_sale_final, 
           type = "response") ) ^2
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

The AIC and mean square error from the Poisson regression model are 1.1108257^{4} and 1.7406663^{-19}, respectively. This histogram of the residual is relatively normal, and the residual plot is randomly scattered around the residual of zero.

Modelling using Poisson regression

Poisson regression is often used for modeling count data. Thus, below is the modelling results using the Poisson regression analysis.

Fitting the model

poisson_reg_fit <-
  glm(
    Sales ~ Ad_Spend + (Year * Month) / Day +
      COVID_Lockdown + Ultra_Edition + Phone_24 + 
      Positive_News + Negative_News + Competition,
    data = market_sale_final,
    family = poisson()
)

ANOVA table of the initial model

car::Anova(poisson_reg_fit) 
## Analysis of Deviance Table (Type II tests)
## 
## Response: Sales
##                LR Chisq  Df Pr(>Chisq)    
## Ad_Spend         1364.9   1  < 2.2e-16 ***
## Year               15.4   4   0.003935 ** 
## Month            3501.8  11  < 2.2e-16 ***
## COVID_Lockdown    356.0   1  < 2.2e-16 ***
## Ultra_Edition       0.4   1   0.506484    
## Phone_24                  0               
## Positive_News      88.1   1  < 2.2e-16 ***
## Negative_News      80.1   1  < 2.2e-16 ***
## Competition         0.3   1   0.594721    
## Year:Month        216.2  42  < 2.2e-16 ***
## Year:Month:Day    131.5 354   1.000000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Choose a model by AIC in a Stepwise Algorithm

poisson_reg_fit_final <- step(poisson_reg_fit)
## Start:  AIC=12543.89
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition + 
##     Phone_24 + Positive_News + Negative_News + Competition
## 
## 
## Step:  AIC=12543.89
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month + 
##     Year:Month:Day
## 
##                   Df Deviance   AIC
## - Year:Month:Day 354   772.78 11967
## - Competition      1   641.53 12542
## - Ultra_Edition    1   641.69 12542
## <none>                 641.25 12544
## - Negative_News    1   721.35 12622
## - Positive_News    1   729.31 12630
## - COVID_Lockdown   1   997.21 12898
## - Ad_Spend         1  2006.15 13907
## 
## Step:  AIC=11967.41
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Competition + Year:Month
## 
##                  Df Deviance   AIC
## - Competition     1   772.98 11966
## - Ultra_Edition   1   773.15 11966
## <none>                772.78 11967
## - Negative_News   1   875.60 12068
## - Positive_News   1   877.81 12070
## - Year:Month     43  1106.23 12215
## - COVID_Lockdown  1  1130.86 12324
## - Ad_Spend        1  2394.15 13587
## 
## Step:  AIC=11965.61
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition + 
##     Positive_News + Negative_News + Year:Month
## 
##                  Df Deviance   AIC
## - Ultra_Edition   1   773.35 11964
## <none>                772.98 11966
## - Negative_News   1   875.80 12066
## - Positive_News   1   877.97 12069
## - Year:Month     43  1122.66 12229
## - COVID_Lockdown  1  1131.06 12322
## - Ad_Spend        1  2397.68 13588
## 
## Step:  AIC=11963.99
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News + 
##     Negative_News + Year:Month
## 
##                  Df Deviance   AIC
## <none>                773.35 11964
## - Negative_News   1   876.17 12065
## - Positive_News   1   878.24 12067
## - Year:Month     43  1135.63 12240
## - COVID_Lockdown  1  1131.43 12320
## - Ad_Spend        1  2397.95 13587

ANOVA table of the final model

car::Anova(poisson_reg_fit_final)
## Analysis of Deviance Table (Type II tests)
## 
## Response: Sales
##                LR Chisq Df Pr(>Chisq)    
## Ad_Spend         1624.6  1  < 2.2e-16 ***
## Year              537.0  4  < 2.2e-16 ***
## Month            5536.1 11  < 2.2e-16 ***
## COVID_Lockdown    358.1  1  < 2.2e-16 ***
## Positive_News     104.9  1  < 2.2e-16 ***
## Negative_News     102.8  1  < 2.2e-16 ***
## Year:Month        362.3 43  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model diagnostics

hist(poisson_reg_fit$residuals)

plot(fitted(poisson_reg_fit), resid(poisson_reg_fit) )

# calculate the mean squared error
mse_poisson <- 
  sum(market_sale_final$Sales - 
  predict.glm(poisson_reg_fit_final,
           newdata = market_sale_final, 
           type = "response") ) ^2
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

The AIC (1.1963986^{4}) and mean square error (8.6403951^{-18}) from the Poisson regression model are slightly worse than the estimates from the linear regression above.

Modelling using eXtreme Gradient Boosting (XGBoost)

Finally, I will also use the XGBoost model to see if we can obtain a model with an even smaller mean square error to ensure the total sales prediction is accurate and robust. When prepping the data for the model training below, I did not split the data into training and testing. This is because the two models above both compute the mean square error with the training data set only. Thus, I would not split the testing dataset to consistently compare the mean square errors between different models.

market_sale_final_mat <- 
  market_sale_final %>% 
    select(Sales, Ad_Spend, Phone_24, Positive_News, Negative_News, Competition,
           Ultra_Edition, COVID_Lockdown, Year_num, Month_num, Day_num) %>% 
    mutate(Year_month = paste0(Year_num, Month_num)) %>%
    as.matrix() 

market_sale_final_mat <- 
  apply(market_sale_final_mat, 2, as.numeric)

train <- market_sale_final_mat

# create DMatrix objects for training and test sets
dtrain <- xgb.DMatrix(data = train[,-1],  label = train[, "Sales"])

# specify the model parameters
params <- list(booster = "gbtree", objective = "reg:squarederror", 
               eta = 0.3, max_depth = 3,subsample = 0.8,
               colsample_bytree = 0.8, nthread = 2)

#tuning hyperparameter
gridsearch_params <- list(max_depth = c(2,4,6),min_child_weight = c(1,3,5))
cv_result <- 
  xgb.cv(params = params, data = dtrain, nfold = 5, nrounds = 100,
       num_boost_round = 100, early_stopping_rounds = 10, 
       metrics = "rmse", verbose = TRUE, 
       seed = 1, 
       maximize = FALSE, 
       param_comb = gridsearch_params)
## [1]  train-rmse:57.852969+0.476198   test-rmse:57.808562+0.418858 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [2]  train-rmse:41.295639+0.309401   test-rmse:41.291347+0.505362 
## [3]  train-rmse:29.748447+0.228122   test-rmse:29.765895+0.475301 
## [4]  train-rmse:21.845696+0.196288   test-rmse:21.978830+0.429040 
## [5]  train-rmse:16.445210+0.128355   test-rmse:16.635187+0.368547 
## [6]  train-rmse:12.747118+0.158243   test-rmse:13.047995+0.345544 
## [7]  train-rmse:10.404637+0.179678   test-rmse:10.741460+0.333321 
## [8]  train-rmse:8.843113+0.111802    test-rmse:9.243221+0.306017 
## [9]  train-rmse:7.797505+0.063773    test-rmse:8.272533+0.278479 
## [10] train-rmse:7.168529+0.099072    test-rmse:7.700926+0.239920 
## [11] train-rmse:6.779529+0.095536    test-rmse:7.327917+0.220856 
## [12] train-rmse:6.484708+0.106734    test-rmse:7.035060+0.215553 
## [13] train-rmse:6.278815+0.121913    test-rmse:6.824892+0.214558 
## [14] train-rmse:6.121293+0.103468    test-rmse:6.684477+0.217459 
## [15] train-rmse:6.008845+0.105604    test-rmse:6.587081+0.230001 
## [16] train-rmse:5.914447+0.095200    test-rmse:6.483416+0.248313 
## [17] train-rmse:5.831796+0.083372    test-rmse:6.413739+0.266887 
## [18] train-rmse:5.762734+0.078059    test-rmse:6.362844+0.254265 
## [19] train-rmse:5.709274+0.075766    test-rmse:6.314493+0.259974 
## [20] train-rmse:5.659984+0.076543    test-rmse:6.268622+0.250437 
## [21] train-rmse:5.614794+0.069840    test-rmse:6.238095+0.251149 
## [22] train-rmse:5.568781+0.088766    test-rmse:6.213908+0.238872 
## [23] train-rmse:5.529566+0.089851    test-rmse:6.177584+0.234247 
## [24] train-rmse:5.487238+0.079895    test-rmse:6.129675+0.236513 
## [25] train-rmse:5.457495+0.075045    test-rmse:6.120602+0.241391 
## [26] train-rmse:5.420269+0.072538    test-rmse:6.084610+0.250495 
## [27] train-rmse:5.394824+0.072564    test-rmse:6.074964+0.247039 
## [28] train-rmse:5.376932+0.069403    test-rmse:6.060184+0.242153 
## [29] train-rmse:5.351277+0.072629    test-rmse:6.055778+0.246861 
## [30] train-rmse:5.319740+0.067331    test-rmse:6.030624+0.250858 
## [31] train-rmse:5.298846+0.065196    test-rmse:6.021959+0.251812 
## [32] train-rmse:5.275055+0.058685    test-rmse:5.992801+0.249521 
## [33] train-rmse:5.256347+0.061291    test-rmse:5.986317+0.242775 
## [34] train-rmse:5.234787+0.059736    test-rmse:5.971700+0.244410 
## [35] train-rmse:5.209314+0.057399    test-rmse:5.957965+0.237388 
## [36] train-rmse:5.193078+0.052812    test-rmse:5.957135+0.237159 
## [37] train-rmse:5.166185+0.052154    test-rmse:5.947602+0.233346 
## [38] train-rmse:5.151848+0.054320    test-rmse:5.948301+0.234458 
## [39] train-rmse:5.135561+0.052012    test-rmse:5.946514+0.228293 
## [40] train-rmse:5.121800+0.052016    test-rmse:5.948655+0.233832 
## [41] train-rmse:5.104253+0.048436    test-rmse:5.929761+0.236392 
## [42] train-rmse:5.093572+0.047828    test-rmse:5.928085+0.232895 
## [43] train-rmse:5.082788+0.047241    test-rmse:5.926136+0.235880 
## [44] train-rmse:5.066347+0.047106    test-rmse:5.928291+0.246709 
## [45] train-rmse:5.053846+0.043688    test-rmse:5.923201+0.246024 
## [46] train-rmse:5.037406+0.045706    test-rmse:5.921876+0.247585 
## [47] train-rmse:5.019601+0.047767    test-rmse:5.910711+0.234497 
## [48] train-rmse:5.000153+0.051261    test-rmse:5.917173+0.226700 
## [49] train-rmse:4.983910+0.053864    test-rmse:5.904486+0.215948 
## [50] train-rmse:4.967404+0.052445    test-rmse:5.895665+0.210841 
## [51] train-rmse:4.958440+0.049448    test-rmse:5.895893+0.214413 
## [52] train-rmse:4.942164+0.051068    test-rmse:5.898594+0.201320 
## [53] train-rmse:4.929109+0.049666    test-rmse:5.905408+0.196692 
## [54] train-rmse:4.914176+0.049694    test-rmse:5.907152+0.207105 
## [55] train-rmse:4.897802+0.047627    test-rmse:5.890759+0.212987 
## [56] train-rmse:4.887218+0.045709    test-rmse:5.887277+0.219786 
## [57] train-rmse:4.875469+0.044760    test-rmse:5.888478+0.220310 
## [58] train-rmse:4.860128+0.048115    test-rmse:5.889477+0.234508 
## [59] train-rmse:4.851516+0.047451    test-rmse:5.884766+0.236204 
## [60] train-rmse:4.839195+0.043259    test-rmse:5.882277+0.231871 
## [61] train-rmse:4.827321+0.042784    test-rmse:5.884144+0.237330 
## [62] train-rmse:4.818399+0.044790    test-rmse:5.893013+0.228989 
## [63] train-rmse:4.808529+0.042725    test-rmse:5.896339+0.228086 
## [64] train-rmse:4.793826+0.037079    test-rmse:5.895039+0.230940 
## [65] train-rmse:4.782079+0.036894    test-rmse:5.894319+0.234007 
## [66] train-rmse:4.769924+0.037295    test-rmse:5.890745+0.227343 
## [67] train-rmse:4.757866+0.041325    test-rmse:5.885826+0.219757 
## [68] train-rmse:4.745465+0.036225    test-rmse:5.879309+0.217778 
## [69] train-rmse:4.732925+0.034829    test-rmse:5.876815+0.217111 
## [70] train-rmse:4.719882+0.033528    test-rmse:5.877755+0.218805 
## [71] train-rmse:4.711788+0.030563    test-rmse:5.874746+0.212870 
## [72] train-rmse:4.695824+0.029547    test-rmse:5.870732+0.211965 
## [73] train-rmse:4.686349+0.031026    test-rmse:5.871659+0.218814 
## [74] train-rmse:4.676199+0.029560    test-rmse:5.871074+0.221318 
## [75] train-rmse:4.667050+0.030636    test-rmse:5.872532+0.219028 
## [76] train-rmse:4.656570+0.030979    test-rmse:5.874747+0.213028 
## [77] train-rmse:4.647919+0.031589    test-rmse:5.874546+0.206055 
## [78] train-rmse:4.638765+0.029053    test-rmse:5.875564+0.201671 
## [79] train-rmse:4.626832+0.029337    test-rmse:5.868786+0.198773 
## [80] train-rmse:4.618170+0.028152    test-rmse:5.864057+0.201775 
## [81] train-rmse:4.602715+0.027356    test-rmse:5.867879+0.208803 
## [82] train-rmse:4.591082+0.026813    test-rmse:5.868856+0.208424 
## [83] train-rmse:4.579448+0.026370    test-rmse:5.873090+0.201267 
## [84] train-rmse:4.569218+0.027652    test-rmse:5.872276+0.197501 
## [85] train-rmse:4.562559+0.026343    test-rmse:5.872017+0.200561 
## [86] train-rmse:4.554654+0.026723    test-rmse:5.864397+0.202739 
## [87] train-rmse:4.543161+0.026600    test-rmse:5.866221+0.206257 
## [88] train-rmse:4.533439+0.022644    test-rmse:5.867745+0.202166 
## [89] train-rmse:4.527212+0.020910    test-rmse:5.866962+0.199708 
## [90] train-rmse:4.519951+0.020538    test-rmse:5.868707+0.194761 
## Stopping. Best iteration:
## [80] train-rmse:4.618170+0.028152    test-rmse:5.864057+0.201775
# update the model parameters with the optimal values
params$max_depth <- cv_result$best_param["max_depth"]
params$min_child_weight <- cv_result$best_param["min_child_weight"]

# train the final model using the optimal parameters and number of rounds
xgboostModel <- xgb.train(params, dtrain, nrounds = 10)

Feature Importance of the model

xgb.importance(colnames(train[,-1]), model = xgboostModel)
##            Feature         Gain       Cover  Frequency
##  1:      Month_num 0.6336764121 0.266527793 0.17692308
##  2:       Ad_Spend 0.1868337666 0.331268501 0.39230769
##  3: COVID_Lockdown 0.1047144240 0.078431641 0.05769231
##  4:       Year_num 0.0273856721 0.059519241 0.04230769
##  5:       Phone_24 0.0161178323 0.029560903 0.02692308
##  6:     Year_month 0.0127830274 0.086654424 0.12692308
##  7:  Positive_News 0.0086914814 0.064768118 0.03076923
##  8:  Negative_News 0.0072239186 0.046677996 0.07692308
##  9:    Competition 0.0011799283 0.018556079 0.02307692
## 10:  Ultra_Edition 0.0009779711 0.011128166 0.01538462
## 11:        Day_num 0.0004155663 0.006907137 0.03076923
predictions <- predict(xgboostModel, dtrain)

# calculate the mean squared error of the predictions
mse_xgboost <- mean((train[, "Sales"] - predictions)^2)

The mean square error of this machine learning model using XGboost is rmse_xgboost`, which much higher than the linear regession model of 1.7406663^{-19}. Therefore, I will use the multiple linear regression above for the making the final prediction.

Final results

As noted above, the multiple linear regression model has shown to perform the best compared to the Poisson regression model and

linear_reg_predict <- 
  predict.lm(linear_reg_fit_final,
           newdata = dec_ad_final, 
           type = "response",
           se.fit = TRUE) 
## Warning in predict.lm(linear_reg_fit_final, newdata = dec_ad_final, type =
## "response", : prediction from a rank-deficient fit may be misleading
dec_ad_final$Sales <- round(linear_reg_predict$fit)


dec_ad_final %>% 
  select(Date, Sales) %>% 
  knitr::kable()
Date Sales
2020-12-01 108
2020-12-02 105
2020-12-03 100
2020-12-04 105
2020-12-05 124
2020-12-06 94
2020-12-07 111
2020-12-08 106
2020-12-09 116
2020-12-10 95
2020-12-11 99
2020-12-12 112
2020-12-13 123
2020-12-14 102
2020-12-15 100
2020-12-16 109
2020-12-17 105
2020-12-18 98
2020-12-19 106
2020-12-20 121
2020-12-21 98
2020-12-22 126
2020-12-23 103
2020-12-24 116
2020-12-25 94
2020-12-26 113
2020-12-27 95
2020-12-28 94
2020-12-29 135
2020-12-30 111
2020-12-31 107

The linear regression model predicts the total unit sales in December 2020 is 3331 with 95% confidence intervals of 3231 and 3430. Thus, the model suggests that the target sale of 3900 units in December 2020 is unlikely to meet.

The plot below is an interactive plot that contains both the historical (green colour-coded) and forecast (red colour-coded) sales with 95% confidence intervals. The user can hover the mouse cursor over the plot to see the actual estimates.

market_sale_final$Type <- "Historical"
dec_ad_final$Type <- "Forecast"


dec_ad_final$Sales_max <- with(linear_reg_predict, fit + 1.96*se.fit) |> round()
dec_ad_final$Sales_min <- with(linear_reg_predict, fit - 1.96*se.fit) |> round()

market_sale_final$Sales_max <- 
  market_sale_final$Sales_min <- 
  market_sale_final$Sales

g <- 
  market_sale_final %>% 
  bind_rows(dec_ad_final) %>% 
  mutate(Type = factor(Type)) %>% 
  ggplot(aes(x = Date, y = Sales, col = Type, group = Type)) +
  geom_path() +
  geom_ribbon(aes(ymin = Sales_min, ymax = Sales_max, fill = Type), alpha = 0.2) +
  scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01), 
                 date_labels = "%b %Y") + 
  theme_light()


plotly::ggplotly(g)  %>% plotly::hide_legend()

Notes

In conclusion, multiple linear regression appears to perform the best here

The model explains 94.51% of the variation in the total unit sold and hsould therefore be an accurate model for prediction.

An alternative method is using a time-series-based analysis with the forecast R package, which can better handle autocorrelation, seasonality, and other temporal patterns in the data. However, applying the method within the forecast R package requires the dataset to be converted to a Time-Series (ts) object. This process can be complicated, especially if we need to include other predictors such as “Total Advertising Spend” in the model for inference and prediction.